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Abstract 

We present a rejection method based on recursive covering of the 
probability density function with equal tiles. The concept works for any 
probability density function that is pointwise computable or representable 
by tabular data. By the implicit construction of piecewise constant ma- 
jorizing and minorizing functions that are arbitrarily close to the density 
function the production of random variates is arbitrarily independent of 
the computation of the density function and extremely fast. The method 
works unattended for probability densities with discontinuities (jumps and 
poles). The setup time is short, marginally independent of the shape of the 
probability density and linear in table size. Recently formulated require- 
ments to a general and automatic non-uniform random number generator 
are topped. We give benchmarks together with a similar rejection method 
and with a transformation method. 

1 Introduction and background 

This article introduces a setup method for a rejection algorithm for the produc- 
tion of random numbers with an arbitrary probability density functions (PDF) 
with finite support and that is at least pointwise computable. The key feature 
is fast production of random variates in computational applications and simple 
applicability to any probability density with any number of modes. The prin- 
ciple consists of covering the surface under the PDF with equal tiles for which 
no a priori information is required. The speed of random number production 
is arbitrarily independent of the shape and computational cost to evaluate the 
PDF. Prior to the introduction of the method we give a brief review of the 
subject, some existing methods and terminology. 



The two topics of uniform and non-uniform random number generators 
(RNG) are rather disjoint, with little overlap in the literature and the asso- 
ciated communities. This is not surprising as the respective problems are quite 
distinct and can be considered as subsequent tasks. A non-uniform RNG re- 
quires a uniform one, usually employed as a black box, and the quality of the 
former depends on the quality of the latter. Contrary to what one might have ex- 
pected, the past 15 years have seen a considerable development of uniform RNGs 
(URNGs) . For almost half a century since the beginning of the age of computing 
in the 1950s, uniform random numbers have been produced with linear congru- 
cntial generators, which are based on the recurrence relation Ij+i = alj + c( 
mod to). After shortcomings began to be noticed in the 1960s, much effort went 
into reducing them by tuning the parameters, especially a and to. However it 
took decades until alternative algorithms were discovered, and then over a dozen 
appeared within a short time, with the possibility of improving the quality by 
combining different methods: the XOR shift URNG [28 , the multiply with 
carry URNG, the linear feedback shift register URNG, etc. For reviews, see 

Refe. USDS! El E2]. 

Fast generation of non-uniform random numbers is important in e.g. Monte 
Carlo simulations [H [2l [HI [13 EZ1 E2J [23j [31]. Statistical theory shows how 
one can produce random variates for any meaningful distribution. Nevertheless, 
intelligent mathematical but also purely computational methodology was de- 
veloped to achieve speed for well known analytical and invertible distributions, 
for non-analytic but transformable distributions and also for empirical PDFs 
that only exist as tabular data. In the earlier days of computing any progress 
was taken very seriously [26] |27j [29] [30] in applied mathematics but also more 
recently new perspectives on seemingly converged methodology on, for example, 
Gaussian distributed random numbers can be found [TH1 HOI HH ■ 

A plethora of mathematically involved publications was inspired by the prac- 
titioner's need to increase the speed and quality of non-uniform random number 
production in applications of statistical computing. Another big driving force is 
simplicity of application. Special requirements of initialisation for example are 
a nuisance. Each context and application provides different and often opposite 
challenges. For example the famous Ziggurat method by Marsaglia [3T] and 
its implementation by Marsaglia and Tsang 32] is a non-truncating method 
within the narrow class of symmetric, strictly decreasing, analytic and invert- 
ible densities. Other algorithms existed long before, but the method's appeal 
is that the specific implementation is even faster than any other that is spe- 
cialized entirely on exponential or normal distributed numbers. The required 
initial data structure setup depends on parameters that so far have been pub- 
lished only for the exponential and normal distributions [32] and are difficult 
to derive automatically [40 . A more general approach is taken by Ahrens pQ. 
This well-known method is able to process any tabular data that fulfills few 
restrictions on smoothness, but the setup and production of random numbers is 
slower. The latter two methods are related to more general strip or slice meth- 
ods — already existing for a long time [7] [33 [3U1 H3 [3S] — but are of much 
higher importance in computational applications. It should be kept in mind 
that information on the speed of a method is only meaningful with respect to 
a particular implementation and hardware. Some famous methods are actually 
specialized implementations that rely on the cache memory of contemporary 
processors. 
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There are many collections of specialized non-uniform random number gen- 
erators, usually more than one for a particular class of PDFs, each of them 
consisting of tailored code. The generators can be categorized into two classes: 
a) A setup of some data structure is carried out before the first random vari- 
ate is drawn, and b) a setup is not needed, e.g. with PDFs for which inversion 
methods exist. Clearly, any "universal" method will need a setup, as explained 
well for instance in Ref. [24 . In statistical computing the user taps from these 
collections of software choosing a particular generator. Ideally one can also 
choose gradually between large setup time and fast generation of random num- 
bers or fast setup and slow generation. The drawbacks of such collections can 
be huge codes, each bug prone, and the increasingly intractable specialties of 
the requirements for the setup. Furthermore, the classes of available distribu- 
tions are limited in the end. The ultimate goal is a universal, easy to use and 
also fast black box generator [24] . The definition and limitations of a universal 
random number generator, however, is often imprecise in many publications. 
For example, it does not have any built-in knowledge of the PDF, except that a 
meaningful PDF can be known only in as much as the number of modes is not 
infinite [2 and that the modes are not infinitely thin, i.e. meaningless delta-like 
functions. For universal applicability all transformation methods drop out be- 
cause they cannot be found automatically by a general algorithm. Therefore, in 
the general case only some approximation like pointwise data will be available 
to represent the PDF, implying that the support of the distribution cannot be 
infinite cither. Any computational approach subdues to this restriction. If the 
PDF can only be evaluated at horrendous costs and no inversion method exists, 
then no procedure will be able to produce usefully random numbers. Thus, we 
can assume that the evaluation cost of the PDF for the appropriate number of 
points is within a similar order of magnitude as the overall task within which 
the random numbers are used, e.g. a Monte Carlo calculation. 

In some cases of PDFs with infinite support truncation of the probability 
density can be justified with statistical negligibility of the tails. In few cases of 
distributions with infinite support and where the PDF is at best pointwise com- 
putable as with the Levy distribution [361 137). which we use as a benchmark, 
transformation methods that do sample the infinite range were found [5J, at 
least within overflow limitations. Such heavy-tailed distributions deserve atten- 
tion if truncated early unless justified (or required) by the application [31 151 12"5]: 
however, this aspect is largely left undiscussed in the literature. State of the art 
methods, e.g. Refs. [7j [TTJ [T21 [TTJ [HI [53], construct in the setup phase a ma- 
jorizing (or envelope or comparison) function and usually also a minorizing (or 
squeeze) function (see Sec. 2.2 for an introduction) with secants or other segmen- 
tations to be used within a rejection technique with look-up tables. In addition 
to truncation, in many methods it is often also required to know the approxi- 
mate or even exact location of the mode (of which mostly only one is allowed) , 
while the method is still declared to be suitable for "arbitrary PDFs" [51 120j. 
More limitations to "general methods" are explained in Ref. jS], where it is 
argued that the modes of the PDF must be known beforehand for certain tech- 
niques to be suitable for "arbitrary densities" . The same authors also construct 
"general algorithms" that depend on concavity properties and analyticity of 
the density. It is common in several methods, e.g. adaptive procedures for log- 
concave distributions to use the value of maximum density explicitly in the 
setup of the approximation of the comparison function [7J. Another example 
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is the transformed density method [3T], that employs a strictly monotonically 
increasing differentiable transform such that the transformed PDF is concave. 
This method is also considered universal. In the improved ratio of uniforms 
method [20] the setup is restricted to certain classes of distributions if the re- 
quired transform of variables must yield a region that can be sampled efficiently 
In some cases one often resorts to a rejection technique and the concept of 
squeeze functions, as also employed in this paper, to improve the situation. Yet 
another general and adaptive approach constructs a polynomial approximation 
of the inverse distribution function in X = i^ _1 (J7) that is stored in tables to 
be used for interpolation during production [IS]. For the class of log-concave 
distributions Ref. [TU] introduces piecewise exponentials for the approximation 
of the majorizing and minorizing functions using previously sampled points of 
the density function. This method however is dedicated to the context of Gibbs 
sampling where each variate is usually drawn from different densities. Finally, 
closing the topic of piecewise approximation, we mention the approximation of 
arbitrary densities via a mixture of simpler densities. An interesting example 
is the triangular approximation giving a piecewise linear approximation of the 
target density via many overlapping triangle densities, which recently was also 
implemented in hardware |41j . The most generally applicable method published 
so far that is also fast is due to Ahrens [TJ [2] . It can deal with more than one 
mode whithout a priori information on the modes. 

Overall we find inevitably that methods titled "universal" , "automatic" , 
"black box" , "out of the box" and combinations thereof clearly cannot sample 
an infinite support of the PDF, are restricted to certain classes of density func- 
tions or are not automatic out of the box. But this verdict is much less restrictive 
in realistic applications of statistical computing where the requested distribu- 
tions can always be represented for example in terms of (interpolated) pointwise 
data or other approximations to any computationally sensible accuracy and fi- 
nite support limits. We stress that approximations of the density function via 
previous sampling of points or the approximation of the inverse distribution 
function F~ l as mentioned above are like other similar concepts a common 
approach in the field of non-uniform random numbers [7J [IS]. It is accepted 
to truncate the originally infinite support, if statistically justified. Moreover, 
if the PDF is given as an arbitrarily accurate approximation, the location of 
extrema can always be determined within any required accuracy in finite time. 
Our method is applicable in this context and belongs to the type of rejection 
and segmentation methods as the ones by Ahrens [T] and Marsaglia [35]. In this 
context and for our claim we can therefore continue to speak of arbitrary PDFs 
since this nomenclature is widely accepted within the literature. Therefore, in 
the quest for a universal random number generator, a method can be called uni- 
versal if it can process arbitrary finite density function data without any further 
information. Techniques that take finite samples of the desired distribution and 
then try to match this distribution [5D] are not discussed here. 

Lately easy applicability has become more important. The initial motiva- 
tion for this work was to overcome the setup difficulty of the Ziggurat method 
for the general symmetric monotonic decreasing case. Eventually, we developed 
a simpler method for a significantly more general class of PDFs, at the cost 
of a moderate performance penalty due to larger memory requirement as com- 
pared to the original Ziggurat implementation of Ref. |32j and to specialized 
transformation methods, were available. 
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With the intention to provide a quintessence of recent research and demands 
of statistical computing a wish-list of requirements to a universal random num- 
ber generator was presented in Ref. |24j . which we quote here: 

1. Only one piece of code, debugged only once. 

2. By a simple parameter choose between fast setup and slow generation or 
long setup time and fast generation. 

3. It can sample from truncated distributions. 

4. The rejection rate can be made as close to zero as desired, i.e. as close to 
inversion as one wants. 

5. The setup time is independent of the density function and is faster than 
many specialized generators. 

6. The quality of the non-uniform random numbers is as good as the under- 
lying uniform random numbers. 

Point 5 should be made more precise. It refers to the independence of the 
shape of the density function, but a density with complicated shape usually 
requires more information, in particular in regions of high curvature. If the 
input size increases, the setup time is indeed allowed to grow. There is no 
obvious universal measure that gives the minimum input size required for the 
suitable representation of a function. This is responsibility of the scientist. 
Several of the methods mentioned in the above overview are considered to meet 
these requirements. This is also the case with the method presented here plus 
additional relaxations with respect to the properties of the PDF. 

In Sec. [2] the tiling is introduced along with some numerical considerations, 
an explanation of the role of the squeeze function, and a proof of correctness. 
Sec. [3] shows that the tiling procedure is capable of dealing unattended with 
poles and other discontinuities. Sec.|4]gives benchmarks of typical and a-typical 
situations. Sec. [5] summarizes and provides a short discussion. We chose for 
comparison well-known methods for specialized distributions (Gaussian and ex- 
ponential) but also a difficult non-analytic distribution (Appendix) for which 
a transformation method is available. We explain computational issues that 
are usually ignored but are decisive for performance. This serves in placing the 
tiling method into the right context among other methods with respect to speed 
and applicability. 

2 The tiling and numerical considerations 
2.1 The tiling procedure 

For any computational task a PDF with finite support, even one with a non- 
invertible distribution for which no specialized method exists, can be represented 
either as a) a sufficiently good approximation that can be evaluated sufficiently 
fast, e.g. by series expansion or polynomials, or b) as tabular data for interpola- 
tion. Any feature of a meaningful PDF f(x) can be represented in the latter case 
by varying the sampling density of the tabular data that represents the PDF in 
the form (x±, f(xi)), (x2, /(a^)), • • • However, since the tiling is completely 
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independent of such considerations, we will simply speak of "evaluating /(a;)". 
Furthermore, it can safely be assumed that the PDF can be evaluated in a finite 
time comparable to the duration of the application within which the random 
variates are to be used. With these prerequisites the determination of local 
extrema is achievable in O(N) where N is the number of data points. 

For the following considerations the integral over the density function is 
required only up to a constant factor C — f{x)dx. The rejection method 
does not require C = 1. The tiling concept is simple, see Fig.[l] The area under 
the PDF f(x), x <G [a,b], is covered with rectangular tiles of equal area. The 
procedure starts from one single tile b— a wide and max(/(a;)) high. Choosing an 
initial tile larger than required by the support and the maximum did not show 
significant influence on the outcome in all cases we tested. The initial tile is split 
into four equal tiles, and so on recursively At each refinement cycle all tiles arc 
split. Those that lie entirely above the PDF are discarded in each cycle. The 
splitting can be stopped once a given accuracy of the covering is reached; Sec. |2.2| 
explains the details of the calculation of this condition. Fig. [TJshows a truncated 
asymmetric Levy PDF with parameters a = 1,/?= 0.7,7 = 1 } 5 = according 
to the So-parametrization convention |36[ 137] . We use this distribution as an 
arbitrary example for comparisons that provides fat tails and for which a fast 
transformation method is available; for details see Sec. [6] The support is chosen 
small to produce deliberately a visible truncation. Fig. [2] shows the tiling of 
a bimodal PDF. The above recursive procedure may be considered the most 
elegant and simple way to construct the tiling. For the subsequent production 
stage it is irrelevant however if the tiling was constructed, for example, by 
plastering, i.e. starting from a small initial tile somewhere within the support. 

Thus, the tiling constructs a piecewise constant majorizing function g(x) of 
the PDF f(x), with g(x) > f(x) Vx <E [a, b]. The closer g(x) to f(x), the better. 
The universal von Neumann rejection method has two main steps: 

a) Generate a random X G [a, b] ~ g(x) and a random uniform Y € [0, g(X)]. 

b) Accept X if Y < f(X), otherwise reject it and repeat the procedure. 

The rejection rate is given by the ratio R of the areas under the PDF and the 
comparison function: 



The denominators correspond to the sum over all N tile surfaces S which are 
equal. 

At this point one could think that an adaptive scheme would be more ap- 
propriate, e.g. only tiles intersected by the PDF are split or even deformed to 
fit the boundary better, or tiles lying below the PDF are merged. Indeed this 
is common practice in computer graphics and some approximation methods. 
However, this measure to save memory is not recommended here; actually, it is 
to be avoided for the sake of simplicity and speed. In the production stage the 
probability of random selection of a tile would have to be proportional to its 
area to guarantee uniform probing. This is more complicated and slower espe- 
cially if the area ratios are not integer. Moreover, a uniform random coordinate 
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Figure 1 : For the intuitive introduction of the tiling procedure the plot shows 
an early refinement stage in the tiling of a truncated asymmetric Levy PDF 
with parameters used as an arbitrary example. 



is more expensive to produce in shapes other than rectangles. Additional de- 
tails to why the segmentation into equal areas is crucial are given in numerous 
publications [TJ 13 [311 122] • Instead of using strips of different height and width 
or even different shape as in other methods, here we suggest equal tiles as the 
clearly simplest and fastest approach. This is the key idea in this work. 

Now the von Neumann rejection can be implemented with a modified first 
step: 

a) Generate a random tile index i = 1, . . . , N; generate a random coordinate 
(X,Y) within tile i. 

b) Accept X if Y < f(X), otherwise reject it and repeat the procedure. 

This way we are able to sample efficiently the majorizing function g(x). More- 
over, the evaluation of the condition in b) is hugely sped up using the implicitly 
constructed minorizing function as explained in more detail in the following 
Sec. O 

Although the sampling with tiles seems sufficiently intuitive and equivalent 
to analogous methods of this kind [311 E2] we give nevertheless a reasoning on 
the correctness. 

Theorem The introduced sampling of the comparison function is equivalent 
to the standard von Neumann sampling, i.e. g(x) is sampled uniformily within 
all tiles generated on the support x G [a, b]. 

Proof Define I = {ii, Z2, is, ij\r} the set of tile indices and Ij C I the 
subset of all indices ij,^,... corresponding to a particular tile column j with 
width Ax = (b — a)/r, where r is the number of columns. Thus (J - Ij = I. 
Construct an bijective mapping i, — > nj with n; < Tty+j.. The mapping is 
purely a renaming of indices in column j. So we have nj — 1, . . . , n 3 masl , = 
g(x)/Ay where Ay is the height of the tile. Note that within column j the 
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Figure 2: Tiling of a bimodal probability density function. f(x) is composed of 
two Levy density functions with parameters a = 1, /? = 0.7, 7=1 (left part) 
and a — 2, (3 = 1, 7 = 1 (right part); the heights are adjusted to fit the curves 
seamlessly at x = 10. 



function g{x) is constant. Now define a random number Y 3 — nj u Ay with 
uniform random u G [0,1). The index n\ is random by the random choice 
of i J k and the subsequent mapping. Then Y 3 S [0,g(x)) is a uniform random 
number in column j and we arrive at the standard situation of the rejection 
method for the interval x £ Axj : Generate a uniform coordinate (X 3 , Y 3 ) with 
uniform X 3 £ Axj and reject X 3 if Y 3 > f(x). The sampling of j is implicitly 
proportional to the size of Ij, i.e. the height of column j, due to the uniform 
sampling of tile indices i E I and (J^ Ij = I. Therefore X J is sampled as 
desired according to g{x) and the sampling of pairs (X, Y) is achieved with 

Xe\Jj{x j }^g(x). □ 

The correctness of the standard rejection method can be taken for granted 
since the seminal paper by John von Neumann [43] . 



2.2 Implicit squeeze function 

The tiling also constructs implicitly a so-called squeeze function q(x) that fulfills 
the condition q(x) < f(x) < g{x) < g{x) within the required interval [a, b]. This 
is the usual definition of the squeeze and comparison functions, see for example 
Ref. [17] . q(x) is the upper edge of the top tiles lying completely underneath 
f(x), or equivalently the bottom edge of the tiles intersected by f(x). The role 
of the squeeze function is to reduce the number of evaluations of f(x) if q(x) can 
be evaluated faster: In the setup all tiles below f(x) are labeled and the test 
Y < q(X) involves no computation — just one label look-up. Actually Y must 
not be generated at all for tiles that are not intersected by f(x). The latter is 
the key advantage of the squeeze function. Thus the following modified steps 
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implement the von Neumann rejection: 

a) Generate a random tile with index i = 1, . . . , N; generate a random X 
within tile i. 

b) Look up if tile i is labeled as "< f(x)". If yes, accept X. Otherwise gen- 
erate Y within tile i and compare Y < f(X). If yes, accept X. Otherwise 
reject it and repeat the procedure. 

With dense tiling most X are accepted in b) by one table look-up only without 
the generation of a second real coordinate Y. The PDF itself is hardly ever 
evaluated. The relative number of evaluations of f(x) per non-uniform variate 
is given by 



The integral over the squeeze function is given by the sum of all tile surfaces 
not intersected by f(x). Thus, the number of evaluations of f(x) can be greatly 
reduced and is equal to the area fraction of the border tiles. Both numbers R 
and E are cheaply calculated on the fly, so that the resulting rejection rate can 
be pre-imposed as a condtion for the tile refinement. The latter results will be 
reconsidered in Sec. 12.31 on the distribution cutoff. 

To have a better measure of the "quality" of g{x) and q(x) we estimate an 
upper limit for the probability density pe that f(x) must be evaluated for one 
non-uniform random number. Define Ax := (b — a)/n where n is the number 
of columns, so Ax is simply the final width of the tiles. For Ax <C b — a, i.e n 
sufficiently high, f(x) can be assumed linear in the interval Ax. Then 



This expression is deduced from the ratio of areas contained in a tile column 
corresponding to Y < q(x) and q(x) <Y< g(x) respectively. 

2.3 Distribution cutoffs 

In the introduction and thereafter we explained that all procedures that are 
not specialized to particular analytic and thus invcrtible distributions will never 
sample an infinite support. Considerations on the appropriate cutoff apply only 
to special distributions [7]. If the support of the PDF f(x) is infinite, a general 
algorithm will inevitably reduce it to a reasonable finite interval x £ [a, b]. It is 
the scientist's responsibility to control appropriately these support limits. 

However, the period length L of the [0, 1] -uniform generator used in the 
sampling along the abscissa must satisfy the condition f(x) < 1/LsX both limits 
o, b [2]. This situation appears for example in the standard rejection method or 
the Ziggurat method. In the latter, the [0, l]-uniform generator must sample the 
whole bottom strip. The sampling procedure in our method lifts this limitation 
by the number of columns n — 2 r ~ 1 , where r = 1, 2, . . . is the refinement level: 
A random integer is generated to sample a tile and a subsequent uniform X is 
generated within the tile. In practically relevant cases the number of tiles will 
always be exceedingly smaller than the period length of any sensible random 
integer generator. Fat (or somehow long) tailed distributions deserve attention 
for the above reason. 




(2) 



Pe(x, Ax) oc 



b-a dlog/(x) 
r dx 



(3) 
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3 Discontinuous probability densities 



The literature also considers density functions which contain a pole (which nu- 
merically is indistinguishable from a cusp) |2], i.e. f[x) — > oo as x — > c + or 
x — > c~ in the range of interest [a, 6]. Within the standard von Neumann rejec- 
tion method [43] a pole is dealt with as follows: Choose e< 1 and assign the 
cumulative probability 

Pc = f £ (4) 

./ c— e 

to the interval [c — e, c + e], a so-called mass point. If the [0,l]-uniform deviate 
is smaller than P c return c. Otherwise sample from [a, b]\[c — e, c + e]. If c and 
c ± e have the same numerical representation then no better method exists to 
sample from f(x). Usually this situation must be treated computationally as a 
special case in the setup and production phases. 

The tiling procedure and subsequent production works unchanged with an 
appropriately approximated (or modified) density function as follows. Fig. [3] 
shows the situation of a density function with a pole at x — c. Figure dimensions, 
especially the vertical scale, are exaggerated to convey intuitively the geometry. 
Choose [c — e, c+ e] and modify f(x) yielding f(x) such that the cumulative 
probability P c according to Eq. Q is preserved (hatched area in Fig. [3]): 

+ f(x)dx = / + f(x)dx (5) 



which gives the implicit condition for max(/(a;)): 



1 



c+e 



max(/(ar)) = — J f{x)dx. (6) 

This is the minimum value of the height of the initial tile. If e is chosen suf- 
ficiently small with numerical or/and statistical reasoning, the result will be 
identical to the procedure in the standard von Neumann rejection described 
above. 

One has to be aware that the choice of e as the smallest representable "dis- 
tance" from the position of the pole is unnecessarily restrictive. Any statistical 
verification requires a significant number of deviates to fall in the region of the 
pole to reveal a possibly too large value for e. Depending on the error norm and 
test method it is likely to turn out that e can safely be chosen magnitudes larger 
than the initial numerical consideration. The statistical needs of the application 
must be considered in any case. Thus, there is no generally obvious upper limit 
for e. 

An example application is the scaled symmetric modified Bessel function of 
the second kind Kq(\x\)/tt, which is the density of the product XY, where X 
and Y are idependent normal distributed random numbers. -Ko(M) diverges at 
x = 0. A possible setting could be the following. Restricting the support to 
x e [-15, 15] accounts for over 99.99999% of all mass. With e = 0.00001 we get 
afraction of 8.03978x 10~ 5 of the mass contained in the interval [0— e, 0+e]. The 
number of recursive refinements is given by |~log 2 (30/(2e))] = 21. About 235 000 
tiles are retained to cover the density function, corresponding to two megabytes 
of memory in our data format. Benchmarks for the setup are presented in 
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Figure 3: Probability density containing a pole at x — c (schematic). The 
returned deviates are numerically correct if the modified density function f(x) 
fulfills the condition of equal area (hatched region) for x S [c — e,c+e] and 
e 1 sufficiently small. The initial tile is chosen max(/(i)) high. The true 
or modified PDF can be approximated via data points as shown or by any 
other method that implements this condition. The mathematical jumps in the 
modified PDF can be modeled numerically via two very close consecutive data 
points along x. 

the next section. Although it is quicker to directly multiply normal random 
variates, this example demonstrates the applicability of the method to other 
densities with no simple alternative. 

PDFs with first order discontinuities or jumps are implicitly contained in 
the above case. With a suitable interpolation scheme one can simply use 
tabular data to model the jump from one data point (xi,f(xi)) to the next 
(xj+i, /(xi+i)) and fix {xi+\ — x{) ~ e as close as numerically possible. There 
will be no deviates falling in [xj, Xj+i]. This situation is contained schematically 
in Fig. [3] as well, showing two consecutive but very near data points for the left 
flank of a jump. 

4 Measurements and comparisons 

In general, speed comparisons are not obvious to do and interpret and only 
meaningful with respect to a particular software and hardware implementation. 
With increasing optimization of code, the mathematical description of a method 
and its implementation become inseparable. We must stress this and point to 
technical aspects that are responsible for a speed difference of two orders of 
magnitude, even though the mathematical/ algorithmic description is identical. 
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The issue of this paper is not pure speed, but portability and easy applicability 
in compromise with speed. 

All measurements were performed on a desktop PC with a 2.4 GHz Intel 
Pentium 4 processor using the GNU C++ compiler version 3.2.2 on Red Hat 
Linux. We explain below the importance of using a multi-tasking operating 
system in its standard operation mode with a typical process time slice during 
the measurements. This will almost always be the case with applications in 
statistical computing. 

4.1 Memory requirements 

At the start the graph of the PDF is embedded in one tile. The memory require- 
ments for a Gaussian-like density function and rejection rate below 0.02 [32 is 
never more than a few megabytes. For details on a uni-modal example as in 
Fig. [T] see Tab. [T] for the bimodal case in Fig. [2] see Tab. [2j Only obnoxious 
density functions with fat tails and many sharp peaks require more memory. In 
the tested variations of such extreme cases using multiple peaks and the support 
truncated very far out the memory needed to achieve a rejection rate below 0.02 
did not exceed 10 megabytes (about one million tiles). This happens using two 
numbers to store the coordinates of one tile and is more than acceptable for 
contemporary desktop computers. We skipped entirely memory optimization 
and removal of redundancy since we preferred a clear class structure and simple 
data management. Setup time can be reduced via speed optimized data struc- 
tures, but that typically increases computation time and storage. Just a decade 
ago the above memory requirements were large for a standard desktop computer 
with a few megabytes memory. This may explain why this fairly straightforward 
method has not been proposed before. 

4.2 Speed of random variate production 

With the SHR3 uniform RNG [32] on the above mentioned configuration our 
method produces 2.6 million non-uniform random numbers per second indepen- 
dently of all tested PDFs. In the following we discuss a few pitfalls of speed 
measurement and code execution, and we compare to other methods. The 
benchmarks refer to methods and implementations that appear most similar or 
useful in judging the tiling method. In any case, the comparisons cannot be 
entirely fair since each method has different specialities. 

In rejection methods the speed of random variate production is arbitrarily 
independent of the PDF and its representation, whether by data points or a 
closed formula. The speed depends only on the properties of the comparison and 
squeeze functions. In all our tested examples with tabular data or simple explicit 
density functions the evaluations representing f(x) are negligible at a rejection 
rate below 0.02. Since interpolation or evaluation of density functions is not the 
topic of this paper, we only give as a rule of thumb that evaluations for 1% of the 
produced random numbers is sufficiently low for almost all practically relevant 
densities. The production of one random variate with the desired distribution 
requires at least two uniform random variates as in most methods. Recently a 
method was published that can provide non-uniform variates with 1 + s, s € 
[0,1], uniform variates where s can be made arbitrarily small |24j . However, 
it turns out that in almost all applications the generation of uniform random 
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Table 1: Number of tiles, rejection rate and evaluation rate for the uni-modal 
PDF shown in Fig. [I] but with a larger cutoff at x = ±64. Refinement level 5 is 
shown in Fig. [I] The memory needed to store 49 685 tiles (refinement level 10) 
is ca. 0.4 megabytes. The evaluation rate tells how often f(x) must be evaluated 
per non-uniform random number. 



Refinement Number of 
level r tiles N 



1 


1 


2 


3 


3 


8 


4 


24 


5 


70 


6 


238 


7 


857 


8 


3 246 


9 


12 609 


10 


49 685 


11 


197233 


12 


785 936 



Rejection Evaluation 



rate R 


rate E 


0.813 


1 


0.750 


1 


0.627 


1 


0.502 


0.910 


0.317 


0.650 


0.196 


0.390 


0.108 


0.220 


0.058 


0.110 


0.029 


0.058 


0.015 


0.029 


0.007 


0.013 


0.002 


0.005 



numbers is not the major sink of computer time. It is up to the scientist to 
evaluate the trade-off between a few percent gain in overall speed and quality 
of the obtained variatcs. The use of less than two uniform random variates 
per non-uniform variates in the context of a rejection technique but also the 
importance of uniform random number quality, in particular in the Ziggurat 
implementation by Marsaglia and Tsang, is commented in Refs. [21 OH IIS1 13"B] . 
Some constructive remarks on the Ziggurat implementation in Ref. |32j can be 
found in Ref. [Mj. 

We chose as one of the benchmarks the symmetric Levy a-stable distribution. 
It is a generalization of the Gaussian distribution, that is recovered for a — 
2; see Appendix. The transformation method by Chambers et al. [3] is the 
contemporary method of choice. As opposed to other published methods it 
has no accuracy deficiency, it does not truncate the support and is sufficiently 
fast for most applications. Moreover, it is applicable to asymmetric Levy a- 
stable deviates too. We use an implementation in C++ for the purpose of this 
comparison. It is about 3 times faster than our method on the above mentioned 
test configuration. 

We also compared to the most efficient implementation of the Ziggurat 
method |32j for exp(— x) and exp(— x 2 ) distributed variates. This implemen- 
tation is considered the fastest for these two distributions. The exponential and 
normal densities could be wired into the code exploiting their mathematical 
properties and using inline coding. In the limit of a negligible rejection rate, 
this Ziggurat implementation could produce 232 million variates per second. 
This means one variate per 10 CPU clock cycles! It is important to note that 
this number could only be achieved if executed alone without any other code, 
for example within a Monte Carlo application. This speed may be surprising 
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Table 2: Statistics for the bimodal PDF shown in Fig. [2] where refinement 
level 6 is plotted. The memory needed to store 151 068 tiles for refinement level 
11 is ca. 1.2 megabytes. 



Refinement Number of 
level r tiles N 



1 


1 


2 


4 


3 


9 


4 


23 


5 


70 


G 


213 


7 


718 


8 


2 602 


9 


9 859 


10 


38 324 


11 


151068 


12 


599 819 



Rejection Evaluation 



rate R 


rate E 


0.858 


1 


0.858 


1 


0.747 


1 


0.605 


0.956 


0.481 


0.871 


0.317 


0.582 


0.189 


0.356 


0.106 


0.195 


0.056 


0.104 


0.029 


0.053 


0.014 


0.025 


0.007 


0.011 



at first sight since the rejection principle is quite similar to the tiling method. 
Actually there are profound differences. First and most obviously, the number 
of tiles is not a power of two. Choosing randomly between exactly 2 s or 2 7 
objects is faster if one uses 8 bits of the 32 bit XOR shift RNG as in Ref. [3"2"] . 
Secondly, it is stated self-evidently in Ref. [32] that small code is important. 
This purely technical issue is hardly ever explained in the literature on random 
numbers despite being highly technical on several occasions. Numerical litera- 
ture [3pJ Chap. 7] finally picks up this issue and also more recently in Ref. [ID] , 
but only briefly say why small code is important. We outline the situation. 

CPUs use hierarchical memory to speed up computation. The access to the 
internal cache memory is magnitudes faster than to the external main memory. 
However, the code and data fitting into this cache is not the only condition for 
faster execution. An algorithm hard-wired in the CPU transfers repeatedly and 
frequently used sections of memory into the cache and also considers the size 
and distribution of the data over the memory banks. A good implementation 
(and compiler) therefore tries to minimize cache misses by arranging data of 
subsequent memory accesses into the same cache line. The latter are sequences 
of bytes transferred into the cache with each memory access. This statistics 
is disrupted by cache misses that are also provoked by a process switch of the 
operating system at built-in time intervals or other events. Small code might 
therefore end up in the cache for a significant time. Very large code that accesses 
its data in random fashion as it is the case in the sampling with tiles will not be 
able to exploit properly cache memory. We can therefore say that the execution 
of code is subject to decisive factors of hardware, compilation and operating 
system that can usually not be controlled entirely. It is also known that CPU- 
specific compilers are able to produce code that can be several times faster than 
a more generic compiler. 

On our typical configuration of operating system and compiler the execution 
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of the Ziggurat code [32] is the fastest by far. The speed factor of ca. 100 to 
our code is in fact consistent to the latency of low-level memory as compared to 
second-level cache of contemporary hardware. This speed difference is leveled 
out considerably if the code and tables of the Ziggurat implementation is forced 
to leave the cache by executing some arbitrary and larger code alternatingly 
with calls to the Ziggurat generator. This measure creates a more realistic use 
case and reduces the execution speed of the Ziggurat code by a factor of ca. 
50. A more rigorous analysis of code and hardware interplay would require the 
exact reproduction of the original test environment which is not readily available 
anymore. 

Finally we make a few more technical remarks and comparisons. The im- 
plementation of the tiling method is only moderately optimized, but completely 
portable and uses throughout Standard Template Library arrays. The period of 
the XOR shift RNG is considered short with 32 bit arithmetic but modification 
to higher models is possible. Following the results in Rcfs. [8, 34, 38 on quality, 
resolution and portability we recommend a slower and also portable uniform 
RNG. Refs [3J [El H5J El 138] also comment other problems of the XOR shift 
RNG in conjunction with the Ziggurat method. The Ziggurat method requires 
for the decision whether to evaluate the density function one coordinate com- 
parison for each attempt to draw a non-uniform number. Our method requires 
one table look-up only. But this advantage is not enough to compensate the 
disadvantage of a large table and resulting slow memory access. 

For accelerated production of random variates to make sense, their part 
must take up a significant proportion of the overall CPU time. But there is 
hardly anything do-able within the order of 10 clock cycles. Moreover, fast 
production of variates imply that enormous amounts are required. This poses 
very high demands on their quality. The findings above as well as the critical 
publications on the Ziggurat implementation Ref. [32] encourage to analyse 
the appropriateness of extremely fast but medium quality variates. A detailed 
analysis of this issue can be found in Ref. [42] . 

4.3 Speed measurements of the setup 

The setup part in our implementation is not speed-optimized but turned out 
to be sufficiently fast for the production of ca. one million variates and above. 
This includes the extreme examples with more than one mode and a very large 
support. To provide a meaningful time measurement for the setup we subtract 
the cumulative time for the evaluations of f(x). For the presented examples 
we used a standard polynomial interpolation with 7 data points. The setup for 
a typical uni-modal PDF (Gaussian or Levy, the latter with sufficiently wide 
support) with 2 15 data points takes ca. 0.2 seconds plus cumulative 2.1 seconds 
for all evaluations of f(x). The calculation time of the Levy PDF via fast 
Fourier transform for 2 15 points is negligible with only 0.2 seconds. Thus, as a 
rule of thumb, the overall total speed of the setup depends almost entirely on 
the number of evaluations of f(x). With a constant number of data points the 
total speed of the setup increases noticeably only for very unusual multi-modal 
PDFs with many sharp peaks and long tails. 

The setup of the Ziggurat for general symmetric, strictly decreasing, non- 
analytic and safely truncatablc PDFs was attempted in Ref. [15]. This setup, 
our C++ version of the Matlab code from Ref. [15] as well as our own generalized 
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iterative C++ code along the original Ziggurat setup formula [35] is sensitive and 
computationally expensive. For example, a numerical error in the flat regions 
of the tail or in the inversion of the PDF can cause a disturbance which often 
causes a breakdown of the procedure. Precautions to mend this are possible 
but complicate the code further and do not guarantee unattended functionality. 
The empirical parameters needed for the setup of the Ziggurat method are 
an additional difficulty for making the method truly automatic. The setup 
time depends strongly on the given data and the above mentioned empirical 
parameters, and is at least one order of magnitude slower than the tiling. 

5 Discussion and conclusion 

We presented a fast method for automatic generation of random variates with 
arbitrary probability density functions independent of symmetry, number of 
modes, and discontinuities. The only prerequisites are pointwise computability 
and finite support. We also explained that the most general thinkable or univer- 
sal method will require no less but also no more than these two requirements. 
In the introductory overview on some representative methods it is shown that 
many less powerful methods exist that truncate the infinite support for analytic 
density functions with only one mode. The accuracy of our method is exact up 
to the computation of the probability density function and meets any numerical 
demand which includes density functions with poles or cusps without additional 
attention. 

The generation of one non-uniform random variate requires only one random 
integer, one random uniform real, two additions, one multiplication and one 
table look-up (no float comparison) most of the time. This is close to the 
minimum of principally required operations, so that additional speed can only 
come from hardware exploitation or specialized methods. Even for complicated 
density functions the memory requirements arc suitable for any contemporary 
desktop computer. 

These properties are not available in other methods of this kind. We can 
extend the wish list from Sec. [2] to a random number generator by additional 
items: 

8. No need for a priori knowledge about the location of any number of modes 
or discontinuities within the density function. 

9. Only pointwise computability and representability of the density function 
is necessary. 

10. Fast setup time and fast generation of random variates. 

1 1 . The discretization and therefore sampling efficiency is asymptotically exact 
and can be pre-imposed. 

An extension to the multivariate case is simple in principle. It means to 
substitute a two-dimensional tile with a cube or hypercube. The required storage 
for data however increases with a power of the dimension. 
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6 Appendix: Levy density function 



The generation of the pointwise density function of the Levy a-stable distribu- 
tion and numbers is a technicality that we report for completeness. Any density 
would suffice for purpose of proving correctness numerically and there are cer- 
tainly more cumbersome densities in statistical computing but we prefer also to 
have inversion methods at hand to produce random numbers. We choose the 
formula by Chambers et al., 1976 [BJ: 

/ - log m cos </> \ 1- ° sin(q0) 
5 7 ^cos((l-a)0)y cos0 ' 1 ' 

where <fi = ir(ii2 — 1/2); u\, «2 6 (0, 1) are uniformly distributed random num- 
bers, 7 is the scaling parameter, and £ is a Levy distributed random number. 
This expression, however, requires different representations for certain ranges of 
a to reduce numerical error. Direct coding is possible but not recommended. 

The pointwise calculation of the PDF is more difficult. It requires the cal- 
culation of a very slowly converging integral [3B] [37J . In the symmetric case the 
Levy distribution can be defined by 

1 f°° 

L(z;a,j) = - / cxp(-(7g) Q ) cos(gz) dq . (8) 

This is a parade example of a non-analytic density function. It does not possess 
any moments other than the first for a € (1,2] due to the divergence of the 
respective integrals except for a — 2 which is the Gaussian limit. Values for 
L(z), Eq. ([8]), can be computed directly [3(0 [37] or by carrying out the Fourier 
Transform 33J . 
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